Tue Oct 7 05:02:31 2014

Overview

Context

  • Building on SigFuge
    • per-gene analysis
    • cluster by per-base expression
  • extend per-gene clustering
    • better annotations
    • incorporate splicing

Connected Components

  • joint work with Yin Hu, Jan Prins
  • data driven annotations
  • graphs: nodes (exons), edges (splice junctions)
  • requires specifying parameters
    • thresh_junctionfilter_num_present_samples (8)
    • thresh_junctionfilter_present_support (5)
    • coverageThreshold_exon (.5)
    • coverageThreshold_intron (10)

Connected Components

##       gIdx gStart  gStop kind  start   stop     s1      s2      s3
## 107 gene56 119793 119838    e 119793 119838   0.00   2.283   2.478
## 108 gene57 119919 119953    e 119919 119953   1.00   3.771   2.743
## 109 gene58 121312 179059    e 121312 121573 128.51 185.855 117.916
## 110 gene58 121312 179059    j 121573 121961  95.00 249.000  32.000
## 111 gene58 121312 179059    j 121573 123217   0.00   5.000   1.000
## 112 gene58 121312 179059    e 121574 121960  10.32  27.439   8.796
## 113 gene58 121312 179059    e 121961 122090 314.45 384.592 181.223
## 114 gene58 121312 179059    j 122090 123217 354.00 335.000  23.000
## 115 gene58 121312 179059    e 123217 123282 411.89 352.758 127.409
## 116 gene58 121312 179059    j 123282 123386 301.00 277.000 136.000

LUSC chr9 Analysis

Lengths of connected components

plot of chunk unnamed-chunk-3

Distribution of median exon expression

compute distribution of average expression over each exon and plot distributions grouped by the number of exons each gene

plot of chunk unnamed-chunk-4

Location of single exon components

## [1] TRUE

plot of chunk unnamed-chunk-6

…contrast with UCSC KnownGenes

plot of chunk unnamed-chunk-8

…contrast with UCSC KnownGenes

  • 845 overlaps
  • overlap consisting of:
    • 254 UCSC genes
    • 830 connected components

plot of chunk unnamed-chunk-10

Example of overlap

plot of chunk unnamed-chunk-11

  • Gene: SPATA31A5
  • overlapped with 39 single exon connected components.

CDKN2A (p16/p14) locus

UCSC Genome Browser view

  • ucsc browser for chr9:21,965,000-21,995,000

Connected components in region

plot of chunk unnamed-chunk-13

Exon level expression for gene1791

plot of chunk unnamed-chunk-15

…adding annotations

plot of chunk unnamed-chunk-16

Genomic coordinates

plot of chunk unnamed-chunk-18

Data Object

Data Object

  • have only really looked at exon information
  • also have coverage of junctions
  • how to study?
    • separately and aggregate?
    • graph structure?
  • example using gene1791 component

Standard PCA, exon expression

plot of chunk unnamed-chunk-20

Standard PCA, exon expression

plot of chunk unnamed-chunk-21

look at clusters

Standard PCA, exon expression

plot of chunk unnamed-chunk-23

Standard PCA, junction coverage

plot of chunk unnamed-chunk-24

Graph based approaches

plot of chunk unnamed-chunk-26

Graph based approaches

## 15 x 15 sparse Matrix of class "dgCMatrix"
##                                               
##  [1,] 75  . 170  . . .  .  . . .  .  .   . . .
##  [2,]  . 30   .  . . .  .  . . .  .  .   . . .
##  [3,]  .  . 118 24 . .  .  . . .  .  .   . . .
##  [4,]  .  .   . 24 6 .  . 12 . .  3  .   . . .
##  [5,]  .  .   .  . 7 .  .  . . .  .  .   . . .
##  [6,]  .  .   .  . . 5  5  . . .  .  .   . . .
##  [7,]  .  .   .  . . . 12  2 . .  4  .   . . .
##  [8,]  .  .   .  . . .  . 16 . .  .  .   . . .
##  [9,]  .  .   .  . . .  .  . . .  .  .   . . .
## [10,]  .  .   .  . . .  .  . . 1  1  .   . . .
## [11,]  .  .   .  . . .  .  . . . 17  .   . 7 .
## [12,]  .  .   .  . . .  .  . . .  . 61  37 . .
## [13,]  .  .   .  . . .  .  . . .  .  . 109 . .
## [14,]  .  .   .  . . .  .  . . .  .  .   . 6 .
## [15,]  .  .   .  . . .  .  . . .  .  .   . . 4
  • example of an adjacency matrix for ONE sample
  • 177 samples leads to a 3-dimensional tensor
    • tensor factorizations
    • spectral methods

Ongoing work

  • understanding connected components better
    • hard rules
    • looking at mismatch density, multimapping, …
  • developing suitable analysis methods
    • clustering graph objects
    • explore alternative formulations

R Environment

R Environment

Running time:

proc.time() - start_time
##    user  system elapsed 
##  47.536   1.181  48.936

Compiled on:

Sys.time()
## [1] "2014-10-07 05:03:20 EDT"

R Environment

Session Information:

sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.3.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rmarkdown_0.3.3                         
##  [2] mvnmle_0.1-11                           
##  [3] igraph_0.7.1                            
##  [4] PTAk_1.2-9                              
##  [5] tensor_1.5                              
##  [6] org.Hs.eg.db_2.14.0                     
##  [7] RSQLite_0.11.4                          
##  [8] DBI_0.3.1                               
##  [9] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
## [10] BSgenome.Hsapiens.UCSC.hg19_1.3.1000    
## [11] annotate_1.42.1                         
## [12] SplicingGraphs_1.4.1                    
## [13] Rgraphviz_2.8.1                         
## [14] graph_1.42.0                            
## [15] GenomicAlignments_1.0.6                 
## [16] BSgenome_1.32.0                         
## [17] Rsamtools_1.16.1                        
## [18] Biostrings_2.32.1                       
## [19] XVector_0.4.0                           
## [20] GenomicFeatures_1.16.3                  
## [21] AnnotationDbi_1.26.1                    
## [22] Biobase_2.24.0                          
## [23] GenomicRanges_1.16.4                    
## [24] GenomeInfoDb_1.0.2                      
## [25] IRanges_1.22.10                         
## [26] RColorBrewer_1.0-5                      
## [27] ggbio_1.12.10                           
## [28] BiocGenerics_0.10.0                     
## [29] ggplot2_1.0.0                           
## 
## loaded via a namespace (and not attached):
##  [1] acepack_1.3-3.3          base64enc_0.1-2         
##  [3] BatchJobs_1.4            BBmisc_1.7              
##  [5] BiocParallel_0.6.1       biomaRt_2.20.0          
##  [7] biovizBase_1.12.3        bitops_1.0-6            
##  [9] brew_1.0-6               caTools_1.17.1          
## [11] checkmate_1.4            cluster_1.15.3          
## [13] codetools_0.2-9          colorspace_1.2-4        
## [15] compiler_3.1.1           dichromat_2.0-0         
## [17] digest_0.6.4             evaluate_0.5.5          
## [19] fail_1.2                 foreach_1.4.2           
## [21] foreign_0.8-61           formatR_1.0             
## [23] Formula_1.1-2            gridExtra_0.9.1         
## [25] gtable_0.1.2             Hmisc_3.14-5            
## [27] htmltools_0.2.6          iterators_1.0.7         
## [29] knitr_1.6                labeling_0.3            
## [31] lattice_0.20-29          latticeExtra_0.6-26     
## [33] MASS_7.3-35              Matrix_1.1-4            
## [35] munsell_0.4.2            nnet_7.3-8              
## [37] plyr_1.8.1               proto_0.3-10            
## [39] Rcpp_0.11.3              RCurl_1.95-4.3          
## [41] reshape2_1.4             rpart_4.1-8             
## [43] rtracklayer_1.24.2       scales_0.2.4            
## [45] sendmailR_1.2-1          splines_3.1.1           
## [47] stats4_3.1.1             stringr_0.6.2           
## [49] survival_2.37-7          tools_3.1.1             
## [51] VariantAnnotation_1.10.5 XML_3.98-1.1            
## [53] xtable_1.7-4             yaml_2.1.13             
## [55] zlibbioc_1.10.0